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t***^ Abstract 

^ ■ Online learning algorithms are fast, memory-efficient, easy to imple- 

. . ment, and applicable to many prediction problems, including classifica- 

^ tion, regression, and ranking. Several online algorithms were proposed in 

the past few decades, some based on additive updates, like the Percep- 
tron, and some on multiplicative updates, like Winnow. Online mirror 
jrt descent is a general prediction strategy offering a unified viewpoint on the 

design and the analysis of online algorithms: most first-order algorithms 
can indeed be obtained as special cases of mirror descent. 

We generalize online mirror descent to sequences of time- varying reg- 
ularizers with generic updates. Unlike standard mirror descent, our more 
general formulation also captures second order algorithms, algorithms for 
composite losses, and algorithms for adaptive filtering. Moreover, we re- 
cover, and sometimes improve, known regret bounds by instantiating our 
analysis on specific regularizers. Finally, we show the power of our ap- 
proach by deriving a new second order algorithm with a regret bound 
invariant with respect to arbitrary rescalings of individual features. 



1 Introduction 



Online learning provides a scalable and a flexible approach for the solution of 
a wide range of prediction problems, including classification, regression, rank- 
ing, and portfolio management. Popular online algorithms for classification in- 
clude the standard Perceptron and its many variants, such as kernel Perceptron 
Freund and Schapire 1999 , p-norm Perceptron Gentile 2003 , and Passive- 



Aggressive Crammer et al. 2006 . These algorithms have well-known counter- 
parts for regression problems, such as the Widrow-Hoff algorithm and its p-norm 
generalization. Other online algorithms, with properties different from those of 
the standard Perceptron, are based on exponential (rather than additive) up- 

for classification and Exponentiated 



dates, such as Winnow Littlestone 
Gradient 



Kivinen and Warmuth 1997 for regression. 



1988 



essentially variants of stochastic gradient descent [Tsypkin 1971 



Whereas these are all 
in the last 



decade many more complex algorithms have been proposed. In particular for 
the the so-called composite setting, that is when a fixed regularization term is 
added to the loss functions, complex ad-hoc algorithms have been proposed that 



take advantage of the structure of the problem Duchi and Singer 



2009 Xiao 



2010 Duchi et al. 2010 . Second-order algorithms use higher-order information 

These include the Vovk-Azoury- Warmuth algorithm 
2001 Azoury and Warmuth 2001| (see also the work of 



from the input features, 
for regression [Vovk 



Forster [Forster'"1999' ) , the second-order Perceptr on [Cesa-Bianchi et al.[|2005 
the CW/AROW algorithms [Dredze et al.[ |2008[ [Grammer et al.[ |2008[ |2009 



2012|, and the algorithms proposed by Duchi et al. 2011 and McMahan and 



Streeter 2010 



Recently, online convex optimization has been proposed as a common uni- 
fying framework for designing and analyzing online algorithms. In particular, 
online mirror descent (OMD) is a general online convex optimization algorithm 
which is parametrized by a regularizer, i.e., any strongly convex function. By ap- 
propriately choosing the regularizer, most first-order online learning algorithms 
are recovered as special cases of OMD. Moreover, performance guarantees can 
also be derived simply by instantiating the general OMD bounds to the specific 
regularizer being used. The theoretical study of OMD relies on convex analy- 

pioneered 



sis 



Warmuth and Jagotal 1997 and Kivinen and Warmuth 2001 



the use of Bregman divergences in the analysis of online algorithms — see the 



book by Gesa-Bianchi and Lugosi 2006 . Shalev-Shwartz and Singer 2007 



Shalev-Shwartz 2007 , and Shalev-Shwartz and Kakade 2009 showed a differ 



ent approach based on a primal-dual method. Starting from the work of|Kakade| 



et al. 2009 , it is now clear that many instances of OMD can be easily analyzed 
using only a few basic convex duality properties. See the the recent survey by 



Shalev-Shwartz 2012 for a lucid description of these developments. 



In this paper we extend and generalize the theoretical framework of |Kakade| 
In particular, we allow OMD to use a sequence of time-varying 



et al. 2009 



regularizers and general updates, not necessarily the sub-gradient of the loss 
functions. These two proporties are known to be the key to obtain second-order 
algorithms, and indeed we recover the Vovk- Azoury- Warmuth, the second-order 



Perceptron, and the AROW algorithm as special cases, with a unique and simple 
analysis and with a slightly improved theoretical guarantee for AROW. However, 
our generalized OMD also captures the efficient variants of these algorithms that 
only use the diagonal elements of the second order information matrix, a result 
which was not within reach of previous techniques. Moreover, we show that a 
proper choice of the time- varying regularizer allows to cope with the composite 
setting too, without the need of ad-hoc proof techniques. 

Our framework improves on previous results even in the case of first-order al- 
gorithms. For example, although aggressive algorithms for binary classification 
often exhibit a better empirical performance than their conservative counter- 
parts, a theoretical explanation of this behavior remained so far elusive. Using 
our refined analysis, we are able to prove the first bound for Passive- Aggressive 
(PA-I) that is never worse (and sometimes better) than the Perceptron bound. 

Time-varying regularizers can also be used to perform other types of adap- 
tation to the sequence of observed data. We give a concrete example by intro- 
ducing a new adaptive regularizer corresponding to a weighted version of the 
p-norm regularizer. The resulting regret bound enjoys the property of being 
invariant with respect to arbitrary rescalings of individual features. Moreover, 
in the case of sparse targets the bound is better than that of OMD with 1-norm 
regularization, which is the standard regularizer for the sparse target assump- 
tion. 



2 Online convex programming 

Let X be some Euclidean space (a finitc-dimcnsional linear space over the reals 
equipped with an inner product). In the online convex optimization protocol 
an algorithm sequentially chooses elements from S" C X, each time incurring a 
certain loss. At each step t ~ 1,2, . . . the algorithm chooses Wt G S and then 
observes a convex loss function £t : S ^^ R. The value tt{wt) is the loss of the 
learner at step t, and the goal is to control the regret. 



i?T(«) = 5I^*("'*)"E^*('") 



t=l 



for all It G S and for any sequence of convex loss functions £j. An important 
application domain for this protocol is sequential linear regression/classification. 
In this case, there is a fixed and given loss function £ : M x M ^ M and a fixed but 
unknown sequence [xi, j/i), {x2, 2/2), ... of examples {xt, yt) E ^ x R. At each 
step t — 1,2,... the learner observes Xt and picks Wt E S CX. The loss suffered 
at step t is then defined as iti^t) = i{{'w,Xt),yt). For example, in regression 

£(^{w,Xt),yt) = {{w,Xt) —yt) ■ In classification, where yt e { — 1,+!}, atypical 
loss function is the hinge loss [l — yt{w,Xt)\ , where [a]^ — max{0,a}. This 
is a convex upper bound on the true quantity of interest, namely, the mistake 
indicator function ^{y^(.w,xt)<o}- 



2.1 Further notation and definitions 

We now introduce some basic notions of convex analysis that are used in the pa- 



per. We refer to Rockafellar 11970] for definitions and terminology. We consider 



functions / : X — >■ K that are closed and convex. This is equivalent to saying 
that their epigraph {{x, y) : f{x) < y} is a convex and closed subset of X x M. 
The (effective) domain of /, that is the set {a; £ X : f{x) < oo}, is a convex 
set whenever / is convex. We can always choose any S' C X as the domain of / 
by letting f{x) — oo for x ^ S. 

Given a closed and convex function / with domain S&X, its Fenchel conjugate 
/* : X — 7* M is defined as f*{u) = supj,^g(^{v,u) — /(f)). Note that the domain 
of /* is always X. Moreover, one can prove that /** = /. 

A generic norm of a vector tx S X is denoted by ||m||. Its dual ||-||^ is the 
norm defined as II V 11^ = sup^{{u,v) : ||m|| < 1}. The Fenchel- Young inequality 
states that f{u) + f*{v) > {u, v) for all v, u. 

A vector a; is a subgradient of a convex function / at v if /(it) — f[v) > 
{u — v, x) for any u in the domain of /. The differential set of / at v, denoted by 
df{v), is the set of all the subgradients of / at v. If / is also differentiable at v, 
then df{v) contains a single vector, denoted by V/(t'), which is the gradient of 
f at V. A consequence of the Fenchel- Young inequality is the following: for all 
X € df{v) we have that f{v)+f*{x) = {v, x). A function / is /3-strongly convex 
with respect to a norm ||-|| if for any it, v in its domain, and any x G df{u), 

j3 2 

f{v)> f{u) + {x,v -u) + -\\u-v\\ . 

The Fenchel conjugate /* of a /3-strongly convex function / is everywhere dif- 
ferentiable and 4-strongly smooth. This means that for all u, t; G X, 

nv) < r («) + {vriu),v- u) + j^\\u-v\\i. 



See also the paper of Kakade et al. [2009 and references therein. A further 



property of strongly convex functions / : S" ^ M is the following: for all u e 

Vf*{u) ^aigmaxUv.u) - f{v)] . (1) 



This implies the useful identity 

/(vr («)) + /*(«) = (vr («),«). (2) 

Strong convexity and strong smoothness are key properties in the design of 
online learning algorithms. In the following, we often write ||-|L to denote the 
norm according to which / is strongly convex. 

3 Online Mirror Descent 

We now introduce our main algorithmic tool: a generalization of the standard 
OMD algorithm for online convex programming in which the regularizers may 
change over time. 



Algorithm 1 Online Mirror Descent 
1: Parameters: A sequence of strongly convex functions /i, /2, . . . defined on 

a common domain SSK. 
2: Initialize: 0i = G X 
3: for t = 1,2,. . . do 

4: Choose wt=\'f;{et) 

5: Observe Zt £X 

Update 9t+i — 6t + Zt 
end for 



Standard OMD — see, e.g., the work of Kakade et al. 2009| — uses ft = f 



for all t. Note the following remarkable property of Algorithm[l] while 6t moves 
freely in X as determined by the input sequence z-t , because of (fTl) the property 
Wt E S holds for all t. 



The following lemma is a generalization of Corollary 4 of Kakade et al. 2009 



and of Corollary 3 of iDuchi et al. 2011 



Lemma 1. Assume OMD is run with functions /i, /2, • . . defined on a common 
domain SSK and such that each ft is /3t -strongly convex with respect to the norm 
\\-\\f . Then, for any u Cz S, 

j2{zt,u-wt)<fTiu)+f2(^^^^^ + f:iOt}-fUiet)] 

where we set /o (0) — 0. Moreover, for all t > 1, we have 

rtiOt) ~ ru(6t) < ft-i{wt) - ftiwt) . (3) 

Proof Let A* = /;(0t+i) - ftUiOt)- Then 

T 

Y,At = f^eT+l) - /o*(0l) = fU0T+l) ■ 

t=l 

Since the functions f^ are ^ — strongly smooth with respect to (||-|L) , and 
recalling that Ot+i = 9t + zt, 



At = f*t{et+i) - ftiOt) + ft{et) - fUiOt) 

2/3, 



< f:{dt) - fUiet) + {vftiOt), Zt) + ^{WztWfX 



= fnOt) - fLiiOt) + {wt,zt) + ^fiztWfjl 

where we used the definition of Wt in the last step. On the other hand, the 
Fenchel- Young inequality implies 

T T 

^ A* = /^(0T+i) > {u, Ot+i) - /t(«) = ^(w, zt) - /tN . 
t=i t=i 



Combining the upper and lower bound on Aj and summing over t we get 

^(«, z,) - Mu) < E ^* ^ E Uif^t) - fUiOt) + {wuzt) + ^{WztWjf^ 
t=i t=i t=i ^ '^^ 

We now prove the second statement. Recalhng again the definition of Wt we 
have that (l2| implies /j*(0j) — {wt,6t) — ft{wt). On the other hand, the Fenchel- 
Young inequality implies that —ft-i{Ot) < ft-i{wt) — {wt,9t). Combining the 
two we get f;{et) - ft-iiOt) < ft-iiwt) - ft{wt), as desired. D 

Next, we show a general regret bound for Algorithm [l] 

Corollary 1. Let R : S ^>- R be a convex function and let gi, g2, ■ ■ ■ be a 
nondecreasing sequence of convex functions gt '■ S -^ M.. Fix 77 > and assume 
ft = gt + V^R cii^e Pt-strongly convex w.r.t. \\-\\ r . If OMD is run on the input 
sequence Zt = —r]£^ for some £^ £ dit{wt), then 



T 



,2 



^(£,(.^0 + R{w,)) - Y^{i,{u) + R{u)) < ^ + ^^ ^1^ (4) 
t=i t=i ' t=i ^* 

for all u Cz S. 

Moreover, if ft = gy/i + rjtR where g : S ^ M. is (5 -strongly convex w.r.t. 
\\-\\, then 

rp rp 

J2{et{wt) + R{wt))~J2i^t{u) + R{u)) < Vf (^ + |max||<||^) (5) 
t=i t=i \ V P - J 

for all u Cz S . 

Finally, if ft —tR, where R is /3-strongly convex w.r.t. ||-||, then 



rp rp 

J2{itiwt) + Riwt)) - ^(£*(«) + i?(n)) < max ||<||^ ^^^ (6) 



T T 

t<T II 'II* 2/3 

for all u Cz S. 

Proof. By convexity, it{wt) — it{u) < -{zt,u — Wt). Using Lemmalllwe have, 

T T ('ll/ll )^ T 

Y,{zt, u-wt)< gT{u) + 7^TR{u) +r,^Y. 28 + ''^((^ " l)R{wt) - tR{wt)) 
t=i t=i '^* t=i 

where we used the fact that the terms gt-i{wt) — gt{wt) are nonpositive under 
the hypothesis that the functions gt are nondecreasing. Reordering terms we 
obtain (|4|. In order to obtain (l5| it is sufficient to note that, by definition 
of strong convexity, g^/t is /^v^-strongly convex because g is /3-strongly convex, 
hence ft is /3-\/i-strongly convex too. The elementary inequality X]t=i ~T — ^'v/T' 
concludes the proof of (Is]). Finally, bound ^ is proven by observing that 
ft=tRis /3i-strongly convex because R is /3-strongly convex. The elementary 
inequality J2t=i 1 — 1 + InT concludes the proof. D 



Note that the regret bounds obtained in Corollary [T] are for the composite 
setting, where the algorithm must minimize the sum £t{-) + R{-) of two functions, 
where the first one is typically a loss and the other is a regularization term. 
Here, the only hypothesis on R is convexity, hence R can be a nondifferentiable 
function, like ||-||-^ for inducing sparse solutions. While the composite setting 
is considered more difficult than the standard one, and requires specific ad-hoc 



algorithms Duchi and Singer 2009 Xiao 2010 Duchi et al. 2010 , here we show 



that this setting can be efficiently solved using OMD with a specific choice of 
the time-varying regularizers. Thus, we recover the results about minimization 
of strongly convex and composite loss functions, and adaptive learning rates, in 
a simple unified framework. Note that the key result to obtain this generality 
is (Isl). This was missing in previous analyses of OMD. 

In particular, a special case of OMD is the Regularized Dual Averaging 
framework of Xiao 2010 , where the prediction at each step is defined by 



Wt = argmm 



1 



for some i'^ £ dis{ws), s 
update is equivalenir] to 



t-i 

s=l 



t-1 



\g{w) + R{w) 



(7) 



1, . . . , i — 1. Using M, it is easy to see that this 



Wt = V/* 




where ft{w) — /3t-ig{w) -I- (i — l)R{w). The framework of Xiao 2010 has 



been extended by Duchi et al. [2010] to allow the strongly convex part of the 



regularizer to increase over time. A bound similar to Q has been also recently 



presented by Duchi et al. 2010 . There, a more immediate trade-off between 
the current gradient and the Bregman divergence from the new solution to 
the previous one is used to update at each time step. However, in both cases 
their analysis is not flexible enough to include algorithms whose update does 
not use the sub-gradient of the loss function. Examples of such algorithms are 
the Vovk-Azoury-Warmuth algorithm of the next section and the online binary 
classification algorithms of Section [6j 



4 Square Loss 



In this section we recover known regret bounds for online regression with the 
square loss via Lemma [I] Throughout this section, X = M'* and the inner 
product (u, x) is the standard dot product u^x. We set £t{u) = ^{vt — u^Xt) 
where {xi,yi),{x2,y2), ■ ■ ■ is some arbitrary sequence of examples (xt,yt) G 



^ Although pCiaol [2010] explicitly mentions that his results cannot be recovered with the 
primal-dual proofs, here we prove the contrary. 



First, note that it is possible to specialize OMD to the Vovk-Azoury-Warmuth 
algorithm for online regression by setting Zt = —ytXt and ft{u) = \vJ AtU, 
where Ai = aid and At = At-i + XtxJ for t > 1. Note that Zt is not 
equal to minus the gradient of the square loss. The regret bound of this algo- 



rithm — see, e.g., Theorem 11.8 of Cesa-Bianchi and Lugosi 2006 — is recovered 



from Lemma [11 by noting that ft is 1-strongly convex with respect to the norm 



\u\\f ~ \/u^AtU. Hence, 



Rt = ^{ytu^xt - ytwjxt) - /t(m) + - ||m||^ + - ^{wjxtf 



t=i 







T 


< 


Mu) 


t=i 


< 


a 11 1 
2H 


^4 



T 

Y^xjAt^xt 



t=i 



fnot)-fu{0t 



Mu) + 



1 "^ 
-Y 



{w-Jxtf 



since /;(6»t) - ft^xiOt) < ft-i{wt) ~ ft{wt) = -\{wj Xtf , and by setting 
Y > maxt \yt\. 



We can also generalize the p-norm LMS algorithm of Kivinen et al. 2006 
for controlling the adaptive filtering regret 



t=i 

(Readers interested in the motivations behind the study of this regret are ad- 
dressed to that paper.) This is achieved by setting Zt — —{yt — wjxt)xt and 
ftiu) = ^f(u) in OMD, where / is an arbitrary /3-strongly convex function 



with respect to some norm 

T 



1, and Xt — maxs<t ||a;s|| . We can then write 



-Rt 



E 



yt 



wJxt)vJxt 



[yt 



wjxt)wjxt 



1 ^ 
< fAu) + -J2iyt 
t=i 



wjxt) 



where in the last step we used Lemma IT] the Xf-strong convexity of ft, and the 
fact the ft > ft-i- Simplifying the expression we obtain the following adaptive 
filtering bound 



R^ 



< 2^/(«) 



T 

^{yt-u^xt) 
t=i 



Compared to the bounds of Kivinen et al. 2006 , our algorithm inherits the 



ability to adapt to the maximum norm of Xt without any prior knowledge. 
Moreover, instead of using a decreasing learning rate here we use an increasing 
regularizer. 



5 A new algorithm for online regression 

In this section we show the full power of our framework by introducing a new 
time-varying regularizer ft generalizing the squared (/-norm. Then, we derive 
the corresponding regret bound. As in the previous section, let X = K'' and let 
the inner product {u,x) be the standard dot product u^x. 

Given (61, ... , bd) G K+ and q E (1, 2] let the weighted q-norm oi w E M."^ be 

Define the corresponding rcgularization function by 

2/q 



/M-^(|:Kr^.) 



This function has the following properties (proof in appendix) . 
Lemma 2. The Fenchel conjugate of f is 

2/p 



rio)-^^{tmni- 



for P=^- (8) 



Moreover, the function f{w) is 1-strictly convex with respect to the norm 

whose dual norm is defined by 

/ d \ i/P 



Ei^^N"' 



We can now prove the following regret bound for linear regression with ab- 
solute loss. 

Corollary 2. Let 

d \ 2/9t 



/let 



where bt^i — maxg^i^...^^ \xs,i\, and let 
qt=il- 



2 In max ||a;,|in 



// OMD is run using regularizers ft on the input sequence Zt = —Tjt.^, where 
(.'t & d£t{wt) for it{w) = \w^Xt — yA and 77 > 0, then 



'^\wj Xt-yt\-^\u^ Xt-yt\ < V2er /21n^jnax^ ||a;t||o - 1 

t=i t=i V ~ '■ "' 



.EU\u.\Bt.. 



for any u £ M.'^ , where Bt/l = max \xt^i\. 

This bound has the interesting property to be invariant with respect to 
arbitrary scahng of individual coordinates of the data points Xt . This is unhke 
running standard OMD with non-adaptive regularizers, which gives bounds of 
the form ||m|| maxj |ja;t||^ \/T. In particular, by an appropriate tuning of rj the 
regret in Corollary [2] is bounded by a quantity of the order of 



y |ui| max |a;j.i| vTlnd 



When the good u are sparse, that is \\u\\t^ are small, this is always better than 
running standard OMD with a non- weighted g-norm regularizer, which for g — > 1 
(the best choice for the sparse u case) gives bounds of the form 



\u\\-^ max \\Xt\ 



Vrind 



Indeed, we have 
/ d 



y \ui\iRax\xt^i\ < y Juil maxmax [xt, 



\i=l 



\u\\, max a;* 
111 J 



Similar regularization functions are studied by Grave et al. 2011 although in a 
different context. 



6 Binary classification: aggressive and diagonal 
updates 

In this section we show that several known algorithms for online binary clas- 
sification are special cases of OMD. These algorithms include p-norm Percep- 
tron Gentile 2003 , Passive- Aggressive Crammer et al. 2006 , second-order 



Perceptron Cesa-Bianchi et al. 2005 , and AROW Crammer et al. 2009 . Be 



sides recovering all previously known mistake bounds, we also show new bounds 
for Passive-Aggressive and for AROW with diagonal updates. 

Fix any Euclidean space with inner product (•, •). Given a fixed but unknown 
sequence {xi,yi), {x2,y2), ■ ■ ■ of examples {xt,yt) G X x { — 1,+1}, let (t^w) = 



10 



£[{w,Xt),yt) be the hinge loss [l — yt{w,Xt}] . It is easy to verify that the 
hinge loss satisfies the following condition: 

if £t{w) > then £t{u) > 1 + (it, 4) for all u,weR'' with £[ e d£t{w). 

(9) 
Note that when £t{'w) > the subgradient notation is redundant, as d£t{w) 
is the singleton {\I£t{w)^. We apply the OMD algorithm to online binary 
classification by setting Zt — -~rit£[ if £t{wt) > 0, and Zt — otherwise. 

In the following, when T is understood from the context, we denote by M 
the set of steps t on which the algorithm made a mistake, yt ^ Vt- Similarly, 
we denote by U the set of margin error steps; that is, steps where yt — yt but 
£t{wt) > 0. Following a standard terminology, we call conservative or passive 
an algorithm that updates its classifier only on mistake steps, and aggressive an 
algorithm that updates its classifier both on mistake and margin-error steps. 

6.1 First-order algorithms 

If we run OMD in conservative mode, and let /t = / = i ||-|| for 1 < p < 2, 



2 II lip 



then we recover the p-norm Perceptron of Gentile 2003 . We now show how 
to use our framework to generalize and improve previous analyses for binary 
classification algorithms that use aggressive updates. 

Corollary 3. Assume OMD is run with ft = f where f , with domain X, is 
(^-strongly convex with respect to the norm \\-\\ and satisfies f{Xu) < X^f{u) for 
all X CzR and all it G X. Further assume the input sequence is Zt — rjtytXt, for 
some < rjt < I such that yt{wt,Xt) < implies rjt ^ 1. Then, for all T > 1 
and for all u G X, 

M < L(u) + D + |/(m)X| + XTjJfluyLiu) 
where M — \M\, Xt = max||a;t||^, 



T 

t=l 




L{u) = 2_,V'-yt{'U;Xt)\ and D = ^r^t \ 7^3 2 



For the conservative p-norm Perceptron, we have U = fb, \\-\\^ = ||-|| where 



1 



q = -^, and /3 = p — 1 because 5 |HL is (p — l)-strongly convex with respect 
to ||-|| for 1 < p < 2, see Lemma 17 of jShalev-Shwartz 



recover the mistake bound of Gentile 2003 



2007 



We therefore 



The term D in the bound of Corollary |3] can be negative. We can minimize 
it, subject to < ?7t < 1, by setting 

r . f Xf-/3yt{wt,Xt) 
rjt = max <^ mm <^ — — , 1 



11 



This tunin g of vt is quite simila r to that of the Passive- Aggressive algorithm 

" ' 1 II l|2 



(type I) of 



Crammer et al. 



2006 



. In fact for ft = f 



we would have 



rjt — max < mm 
while the update rule for PA-I is 



X? -yt{wt,xt 

IktIP 



1,0 



l-yt{wt,xt, 
rit = max < mm \ 1^ — ^ ,1^,0 

The mistake bound of Coro llary [S] is however better than the aggressive bounds 
for PA-I of Crammer et al. 2006 and Shalev-Shwartz 2007 . Indeed, while the 



PA-I bounds are generally worse than the Perceptron mistake bound 

M < L{u) + {\\u\\ Xrf + ||m|| XtVHu), 



(10) 



as discussed by Crammer et al. 2006 , our bound is better as soon as Z? < 0. 



Hence, it can be viewed as the first theoretical evidence in support of aggressive 
updates. 



Proof, (of Corollary Isl) Using ( 16 1 in Lemma [s] with the assumption rjt = I 
when t G M, we get 



M < L{u) + \/2f{u) 



\ 



teM 



^t\ 



t£U 



Y^iM:c,\\l + 2r^M'^u^t))-Yl 



.Vt 



teu 



<L{u)+XT\j^fiu) 



\ 



M+E 



■r]?\\xtE +2Pr]tyt{'Wt,x 



t£U 



n 



teu 



where we have used the fact that Xt < Xt for all t < T. Solving for M we get 

(11) 



M < L{u) + ^/(w)X| + XT^^f{u)^j^X^fiu) + Liu) + D' - ^ , 



with ^X^f{u) + L{u) + D' > 0, and 



^'-E 



teu 



il^\\xt\\l + 2l3r]tyt{wt,xt) 



Vt 



We further upper bound the right-hand side of (11) using the elementary in- 
equality y/a + b < yj~a -f -Yj^ for all a > and b> —a. This gives 



M < Liu) + ^f{u)Xl + XT^^f{u)^^X^fiu) + Liu) 



XTD'Jjfiu) 
2J^X^fiu) + Liu) teu 



< Liu) + ■^/(tt)Xl + XT^^fiu)^^Xyiu) + Liu) +D' -Y,m 
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Applying the inequality \/a + b < ^J~a + \/h and rearranging gives the desired 
bound. n 

6.2 Second-order algorithms 

We now apply our framework to second-order algorithms for binary classifica- 
tion. Here, we let X = M"^ and the inner product (it, x) be the standard dot 
product u^ X. 

Second-order algorithms for binary classification are online variants of Ridge 
regression. Recall that the Ridge regression linear predictor is defined by 

wt+x = argmm | y \w^ x^ - Vs) + \\w\ 

The closed-form expression for lOf+i, which involves the design matrix St = 
\xi,...,Xf\ and the label vector y^ — {yi, . . . ,yt), is given by Wt = (/ + 

Sj St) StlJf The second-order Perceptron (see below) uses this weight Wt+i, 
but St and y^ only contain the examples (a;^, j/s) on which a mistake occurred. 
In this sense, it is an online variant of the Ridge regression algorithm. 

In practice, second-order algorithms perform typically better than their first- 
order counterparts, such as the algorithms in the Perceptron family. There are 




two basic second-order algorithms: the second-order Perceptron of Cesa-Bianchi 



et al. 2005 and the AROW algorithm of Crammer et al. 2009 . We show that 
both of them are instances of OMD and recover their mistake bounds as special 
cases of our analysis. 

Let Zt = —rjtlti'Wt), and ft{x) = ^x^ At x, where it is the hinge loss, Aq = I, 
and At = At-i + -.XtxJ with r > 0. Each function ft is 1-strongly convex with 

respect to the norm ||a;||/j = \/x^ At x with dual norm (||a;||/t)^ ~ dx^A^^ x. 

The dual function of ft{x) is ft{x) = ^x^ A^^x. Now, the conservative version 
of OMD run with ft chosen as above is the second-order Perceptron. The ag- 
gressive version corresponds instead to AROW with a minor difference. Indeed, 
in this case the prediction of OMD is the sign of wjxt = rrit ^Z , where we use 

the notation xt = xj A^\xt and rut = xj A'^\6t. On the other hand, AROW 
simply predicts using the sign of TOj. The sign of the predictions is the same, 
but OMD updates when ytmt ^._[ < 1 while AROW updates when ytmt < 1. 
Typically, for large t the value of Xt is small, and thus the two update rules 
coincide in practice. 

To derive a mistake bound for OMD run with ft{x) — ^x^ At x, first observe 
that using the Woodbury identity we have 

ft iOt) ~ ft-,{Ot) - -2(, + ,T^-_i^,^) - -2(.^T^ ■ 
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Hence, using (161 in Lemma pj and setting 77^ = 1, we obtain 
M + U < L{u) + ^/uJAT u 



\. 



eMLiU 



X! ^^^t ^^t + 2ytwjxt 



r + Xt 



<L(«)+ /||«||2 + i J2 ("^ 



Xt) 



teMuu 



\ 



•In I At I + J2 I'^ytwjxt 



teMuu 



r + Xt 



L{u) + Jr\\u\\ 



E ( 

teMuu 



vJxt) 



/in \Aj 



E 

teMuu 



mt{2ryt - nit) 
r{r + xt) 



for all M G X, where 



Hu) = ^[l-yt{u,Xt 



This bound improves slightly over the known bound for AROW in the last sum 
in the square root. In fact in AROW we have the term U, while here we have 



mt{2ryt - mt) ^ ^^ mt{2ryt ~ rrit) 

r{r + Yt) ~ ^-^ "rir + Xt) 
teMuu ^ ^^' teu ^ ^^' teu 



E 



< 



E 



r{r + Xt) 



< U 



(12) 



In the conservative case, when U = (/}, the bound specializes to the standard 
second-order Perceptron bound. 

6.3 Diagonal updates 

AROW and the second-order Perceptron can be run more efficiently using di- 
agonal matrices. In this case, each update takes time linear in d. We now use 
Corollary [3] to prove a mistake bound for the diagonal version of the second- 
order Perceptron. Denote Dt = diagj^t} be the diagonal matrix that agrees 



l^T 



X DfX. 



with At on the diagonal, where At is defined as before and ft{x) 

Setting rit = 1, using the second bound of Lemma wl and Lemma H we hava^ 

M + U < L{u) + 



\ 



\ ,;=i \^ teMuu ) ) 



\ i=\ \teMuu I \ i=\ \ teMu 



2U 



(13) 



This allows us to theoretically analyze the cases where this algorithm could be 
advantageous. In particular, features of NLP data are typically binary, and it 
is often the case that most of the features are zero most of the time. On the 
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Figure 1: Evidence of heavy tails for NLP Data. The plots show the number of 
words vs. the word rank on two sentiment data sets. 



other hand, these "rare" features are usually the most informative ones — see, 
e.g., the discussion of Dredze et al. 2008 , Crammer et al.| [2012 



Figure [I] shows the number of times each feature (word) appears in two sen- 
timent datasets vs. the word rank. Clearly, there are a few very frequent words 
and many rare words. These exact properties originally motivated the CW and 
AROW algorithms, and now our analysis provides a theoretical justification. 
Concretely, the above considerations support the assumption that the optimal 
hyperplane u satisfies 



i=i teMuu 



^t.i 



E^? E 4»^«E"?~^ii'"ii 



iei teMuu 



iei 



where X is the set of informative and rare features, and s is the maximum 
number of times these features appear in the sequence. Running the diagonal 
version of the second order Pereptron so that U = %, and assuming that. 



E"? E ^^M^^ii^ii 



(14) 



i=l teMUU 



the last term in the mistake bound ( 13 ) can be re-written as 



A HP + jE-?E<n"Ei-f^E<.: + i)^ii" 



where we calculated the maximum of the sum, given the constraint 

d 

Y, E ^M < ^tM . 






V dr 



i=l teM 



We did not optimize the constant multiplying U in the bound. 
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We can now use Corollary [5] in the appendix to obtain 

M < L{u) + \\u 



•\ 



(,. + ,,)<iln^2WBl±ia+2i(„)--^^ 



dr^ dr 



Hence, when the hypothesis ( 14 ) is verified, the number of mistakes of the 
diagonal version of AROW depends on ^J\nL{u) rather than on y/L{u). 

Diagonal updates for online convex optimization were also proposed and an- 



alyzed in Duchi et al. 2011 McMahan and Streeter 2010 . When instantiated 



to the binary classification setting studied in this section, their analysis delivers 
regret bounds which are not comparable to ours. 

7 Conclusions 

We proposed a framework for online convex optimization combining online mir- 
ror descent with time- varying regularizers. This allowed us to view second-order 
algorithms (such as the Vovk-Azoury-Warmuth forecaster, the second-order Per- 
ceptron, and the AROW algorithm) and algorithms for composite losses as spe- 
cial cases of mirror descent. Our analysis also captures second-order variants 
that only employ the diagonal elements of the second order information matrix, 
a result which was not within reach of the previous techniques. 

Within our framework, we also derived and analyzed a new regularizer based 
on an adaptive weighted version of the p-norm Perceptron. In the case of sparse 
targets, the corresponding instance of OMD achieves a performance bound bet- 
ter than that of OMD with 1-norm regularization. 

We also improved previous bounds for existing first-order algorithms. For 
example, we were able to formally explain the phenomenon according to which 
aggressive algorithms typically exhibit better empirical performance than their 
conservative counterparts. Specifically, our refined analysis provides a bound 
for Passive-Aggressive (PA-I) that is never worse (and sometimes better) than 
the Perceptron bound. 

One interesting direction to pursue is the derivation and analysis of algo- 
rithms based on time-varying versions of the entropic regularizers used by the 
EG and Winnow algorithms. More in general, it would be useful to devise a 
more systematic approach to the design of adaptive regularizers enojoying a 
given set of desired properties. This should help in obtaining more examples of 
adaptation mechanisms that are not based on second-order information. 
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Technical lemmas 

Proof, (of Lemmapl) The Fenchel conjugate of / is /* (6) — sup„ (^v^9 — /(v)) . 

Set w equal to the gradient of ^. ^_-^-. [J2i=i \^i\^^i~^) with respect to 6. 
Easy calculations show that 

w'^e- f{w) 



2{p 



'-(p'"'-) 



We now show that this quantity is indeed sup„ v^9—f{v). Pick any v E M''. Ap- 
plying ] 
we get, 



plying Holder inequahty to the vectors {vibi , . . . , VdhJ'^) and [Oih^ , ■ ■ • , Oah^ ) 



^ll / d \^Ip / d \ 1/? / rf X i/p 



Hence 

/ d \^/l / d \ 1/P / d s 2/, 

The right-hand side is a quadratic function in I X]i=i I'^^i 1"^ ^i ) • If we maximize 
it, we obtain 

which concludes the proof for /*. 

In order to show the second part, we follow Lemma 17 of [Shalev-Shwart^ 

and prove that fef=i k»l'^<) "^ < x'^\7^f{w)x. Define *(a) "'^' 



2007 



2(9-1) 



and 4>{a) = |a|«, hence f{w) = *(I]i=i ^^'^(^^i))- Clearly 

*'(«) == ^^ and *"(«) = l^a^/^-^ . 

Moreover, 0'(a) = (7sign(a)|a|*^-'^ and (/)"(a) = (7((j' — l)|a|''^-^. The {i,j) element 
of W'^f(w) for i y^ j is 

The diagonal elements of V"^ f{w) are 



^fc=i / \fe=i 
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Thus we have 



Vfe=l / \i=l / \fe=l 



The first term is non- negative since g G (1,2). Writing the second term expHcitly 
we have, 

/ d \ 2/9-1 d 



^fc=l / i=l 



We now lower bound this quantity using Holder inequality. Let yi ~ ^7 1^« I 9)?/2 
for 7 = (2 - q)/2. We have 



d \2/9 ., ^2/, / (2-g)/2 2,2/«\«/2\'/^ 



\i=l 



^i=l y^ J \ \i=i / \i=l Vi 



/ ,, \ (2-g)/2 , ,, ^, •. g/2\ 2/9 



2 2 



(67 Iw, I (2-9)9/ 



,2/9 



= (E(^r/^^^^^M) (E^^ 



(2-9)/9 / d ^,2lq 



/ d \ (2-9)/9 / d \ 

= (^E(^>^nj (^E-^i-^i""'^''^"^"^''^] 

/ d \ (2-9)/9 / d \ 

= (E(^>^I')) (E^'i^^I'^'^O • 

We just showed that 

/ d \ 2/q-l ^ , ^ , 2/9 

x^V2/(ti;)a3>K]6fc|w;,|0 E ^^^^l^^l'"" ^ E <^0 ' 

This concludes the proof of the 1-strict convexity of /. 

, \ l/q / J 1 \ 1/p 



We now prove that the dual norm of I ^^^j^ |a;i|'^6i) is \Ylii=\ l^il^^ 
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By definition of dual norm, 



sup < u~'^x : ^ \xi\'^bi 



1/9 



< 1 > = sup 



sup 



u^x: (Eh^^f) 



1/9 



< 1 



T.-^y^K'^'^ T. 



d \ 1/9 

A <i 



m 



\i=l 



( ;.-l/9 ;.-l/9MI 

V /lip 



wliere 1/g + 1/p = 1. Writing the last norm explicitly and observing that 

p^q/{q~ 1), 



El 



i/p 



wi^-p/i 



= (y:\u.\'bi-A 



i/p 



which concludes the proof. 



D 



Lemma 3. Assume OMD is run with Junctions /i, /2, • ■ • defined on X and such 
that each ft is jSt-strongly convex with respect to the norm \\-\\f and ft{\u) < 
X^ft{u) for all X ^ R and all u Q S. Assume further the input sequence is 
Zt = —rjtt-t for some r]t > 0, where t^ G d£t{wt), (-t{wt) — implies i^ = 0, and 
£( = ^((•,a;t),2/t) satisfies ||j. Then, for all T>1, 



E m<Lr, + \fT{u) + UB+ E 

M\JU \ teMUU 



tGMUU 



m 

2/3* 



\ftl* 



vt{wt,K] 



(15) 



for any u €z S and any A > 0, where 

T 

Ln= E ''^^^N """"^ B = J2{f:{Ot)~fU{0t)) ■ 

teMuu t=i 

In particular, choosing the optimal X, we obtain 



E Vt<L^ + ^y/friu] 



tGMUU 



\ 



tGMUU 



^+ E [ikS\Kh):-v.{^u^^ 



2/3, 



■^ + 



(16) 



Proof. We apply Lemma 111 with Zt = —rjtt't and using Am for any A > 0, 



E^*«,«'t - Am) < AVt(«) + E [vB^¥t\\iX + /**(^*) - /*-i(^*) 



2A 
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Since itiwt) — implies £[ = 0, and using M, 

T 



teMuu 



Dividing by A and rearranging gives the first bound. The second bound is 
obtained by choosing the A that makes equal the last two terms in the right- 



hand side of (15 1. D 



Lemma 4. For all Xi,...xt € K'' let Dt = dia,g{At} where Aq = I and 
At = At-i + -XtxJ for some r > 0. Then 



t=i 1=1 \ t=i I 



Proof. Consider the sequence at > and define Vt — oq + X]i=i '^i with oo > 0. 
The concavity of the logarithm implies Infe < Ina + ^=^ for all a,b > 0. Hence 
we have 

y. at ^ y. Vt - Vt-^l < y- 1j^ J^ ^ 1j^ !^ ^ ^j^ «0 + ELi «« 

Using the above and the definition of Dt , we obtain 

T d T ^2 '^ ^ a;2 . 

E ^t A^'^t ^Y.Y. 1 , 1 Jt' — r = ^ E E , ^t '' — r 



4=1 t=l ' r ^^j = l i,i 4=1 t=l ' ^^j = l j,i 

< r ^ In ■ 

4 = 1 



n 



We conclude the appendix by proving the results required to solve the im- 
plicit logarithmic equations of Section |6.3| We use the following Lemma of 



Orabona et al. 2012 



Lemma 5. Let a, a; > he such that x < alnx. Then for all n > 1 

n na 

X < a in — . 

n ~ 1 e 

This allows to prove the following easy corollaries. 

Corollary 4. For all a, b,c,d,x>0 such that x < ahi{bx + c) + d, we have 

n ( nab \ c 1 
X < a In h d 



n — 1 \ e / b n — 1 
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Corollary 5. For all a, 5, c, d, a; > such that 

X < ^/a\n{bx + 1) + c + d 
we have 

x< 



(17) 



\ 



a In I ^— h 2b^/c + 2d6 + 2 | +c + d 



Proof. Assumption (17) implies 



x^ < (^/a\n{bx + l)+c + d) (18) 

< 2aln(fe.T + 1) + 2c + 2rf^ 
= aln(6a; + l)2 + 2c + 2d^ 

< a ln(26^a;2 + 2) + 2c + 2d'^ . 

From Corollary I4] we have that if f,g,h,i,y > satisfy y < f\n{gx + h) + i, 
then 

^ n ( f, nfg , .\ h 1 n (npg \ h 1 

n — 1 \ e J g n — 1 n — l\e^ / g n — 1 

where we have used the elementary inequality Iny < ^ for all y > 0. Applying 



the above to ( 18 ) we obtain 

n /2na^6^ 



x^< 



n — I \ e^ 



which implies 



x< 



n — 1 



2nab 



2c + 2d' 



2c+\/2d 



1 1 
b'^ n — 1 



1 1 



6\Ar^n; 



(19) 



Note that we have repeatedly used the elementary inequality y/x + y < ^Jx^Jy. 
Choosing n — 2 and applying (19 1 to (17) we get 



X < y/a\n{bx + 1) + c + d < 
concluding the proof. 



\ 



a In h 26Vc + 2d6 + 2 +c + d 



D 
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